class: center, middle, inverse, title-slide .title[ # Inference with linear models ] .subtitle[ ## Lecture 5 ] .author[ ### Manuel Villarreal ] .date[ ### 08/28/24 ] --- ### Recap - In the last lecture we talk about how to add interaction terms to a linear model in R using the `lm()` function. -- - A visual inspection of the residuals using histograms, scatter plots, and auto-correlation plots, suggests that the model's residuals behave in a way that agrees with the multiple linear regression assumptions. -- - Mean close to 0, constant variance, low auto-correlation. -- - However, we also noted that the residuals of the model that does not include the interaction term also behave in a manner that's close to our assumptions. -- - The problem we face now is, How can we decide if adding an interaction, or any covariate for that matter, is supported by our data? --- ### Inference - The problem of selecting which covariates to add to a model, or which model to choose from a pool of candidates is known as an **Inference** problem. -- - Some possible solutions to this problem include: - Null Hypothesis Testing. - Model selection - Forward/Backward selection algorithms (please never use these!) - Bayesian inference. -- **And most importantly: Theoretical Constraints**. That is, we choose the covariates that are most relevant to our scientific question, or that are relevant according to a theory that we want to evaluate. --- ### Model selection - In this class we will focus on an alternative approach in classical statistics known as model selection -- - The key idea is that we want to frame the selection of which variables to keep in a model as a decision problem. -- - What should we look for in a model? -- - The first thing that we want is for our models to be accurate, that is, we want models whose "predictions" are close to our actual observations. --- ### Model selection - One way to measure the accuracy of a linear model is by using the residuals. -- - Remember that linear models "predict" the conditional expectation of a dependent variable given the values of an independent variable. -- - In our blood pressure example with only age as a predictor we said that: `$$\mathrm{E}\left(\text{blood pressure}_i | \text{age}_i\right) = \beta_0 + \beta_1\text{age}_i$$` -- - Therefore, we can think of the difference between our observations and the expected values (the line) as a measure of the accuracy of the model. --- ### Blood pressure example - Going back to our blood pressure example, let's look at the first two models we applied, one with age and the other with age and sex but with no interaction: `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age}_i + \epsilon_i$$` `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age}_i + \beta_2 \text{sex}_i + \epsilon_i$$` -- - It would be too complicated to compare the residual for each participant on each number, so instead let's take the average: ``` r rss_a <- mean((lm_a$residuals) ^ 2) rss_as <- mean((lm_as$residuals) ^ 2) ``` -- - **Notice** that we take the average of the residual squared, this is because, by construction, the residuals will always add to 0! --- ### Blood pressure example - The mean of the squared residuals for the model that only includes age was **13.03**, while the mean of the squared residuals of the model that includes both age and sex was **10** -- - This means that the mean error of the model that includes both age and sex is lower, but should we keep the second model. -- - Is having a lower error enough to justify the addition of a new covariate and parameter? -- - The short answer is no, it is not enough. -- - Why? Well the problem is that, whenever we add a new parameter (like the betas) to a model, it's error will always decrease. -- - This means that looking at the errors is not enough to decide which model we should keep. --- ### Model selection - In general with linear models, it will always be the case that adding more covariates will make the error smaller. -- - Therefore, in order for us to know if the difference between the mean squared errors of two models is enough to guarantee the additional parameter we need to take into account how many parameters are in the model. -- - This is sometimes referred to as model complexity, the key intuition is that under this perspective, adding a new parameter to a model will make it more complex. -- - A different way to think about it is that adding parameters makes the predictions of a model more "flexible" -- - In our example, adding the parameter associated with the covariate "sex" added a completely different line to our predictions (one for each sex in the sample). --- ### Model selection - Therefore, in order to choose between the two models that we have we need a method that will fulfill two conditions. -- - First, we want a method that assigns "better" values to models that have smaller errors (like the model with age and sex). -- - Second, we want a method that assigns "better" values to simpler models (models with less parameters). -- - One of the methods that meets both of these conditions and is typically used in the literature is the Bayesian Information Criterion. --- ### Bayesian Information Criterion. - The Bayesian Information Criterion (BIC) is a method used for model comparison that combines a measure of a model's "complexity" and the accuracy of a model. -- - For linear models the BIC is easy to calculate: `$$BIC(M_j) = \underbrace{n\ ln\left(\frac{\sum_i^n\hat\epsilon_i^2}{n}\right)}_{\text{error}} + \overbrace{k\ ln(n)}^{\text{complexity}}$$` -- - According to this function, as the error and the number of parameters of model `\(M_j\)` increases, so does the value of the BIC of that model. -- - In general, we will look for the model that has the lowest BIC from the ones that we are testing. --- ### Bayesian Information Criterion. - Let's start by computing the BIC of the model that only includes age as a predictor. `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age}_i + \epsilon_i$$` -- First, we can use the `lm()` function to get the model's residuals. ``` r linear_age <- lm(formula = blood_pressure ~ age, data = blood) ``` -- - Now the R object `linear_age` contains the outcome of the linear regression with age as a predictor, therefore, we can look at the model's residuals by calling `linear_age$residuals`. --- ### Bayesian Information Criterion. Using the residuals we can calculate the **error** component of the BIC for our model. -- First, we need the sample size, which in this case is equal to the number of rows in our data: ``` r n <- dim(blood)[1] ``` -- Then we can then get the mean squared error by taking the average of the residuals squared: ``` r epsilon_2_bar <- mean(linear_age$residuals ^ 2) ``` -- Finally, we need to take the natural logarithm of the mean squared errors and multiply it by the sample size. The `log()` function in R by default uses the natural logarithm ``` r error_bic_age <- n * log(x = epsilon_2_bar) ``` --- ### Bayesian Information Criterion. - We can compute the model's "complexity" by first counting the number of parameters (values we need to estimate) in the model. -- Remember that our current model is: `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age}_i + \epsilon_i$$` -- - In this case we have to estimate `\(\beta_0\)`, `\(\beta_1\)` and `\(\sigma^2\)` which is the variance of the errors. -- Thus, we can say this model has 3 parameters, although some people would only count the `\(\beta\)`'s. ``` r k_age <- 3 ``` --- ### Bayesian Information Criterion. Following the equation for the BIC we can calculate the complexity term as: ``` r complexity_age <- k_age * log(x = n) ``` -- Then, the model's BIC is equal to .center[ `\(BIC\left(\mathrm{M}_{age}\right) =\)` 529.37] -- - The problem with BIC's is that we can't interpret its value in isolation. -- - Instead, the BIC values are only meaningful relative to other BIC values. -- - In other words, we are only able to tell if one model is better in terms of it's BIC than another, but we can't tell if that model is **true** or not. --- ### BIC for MLR - Let's calculate the BIC value for the multiple linear regression model that includes both age and sex as independent variables but no interaction. -- - Like before, we have to start by "fitting" the model and calculating the mean squared residuals. -- ``` r linear_age_sex <- lm(formula = blood_pressure ~ age + sex, data = blood) epsilon_2_bar_agesex <- mean(linear_age_sex$residuals ^ 2) ``` - The main difference is that now the number of parameters in the model has increased from 3 to 4 (including the variance of the errors). -- ``` r k_age_sex <- 4 BIC_age_sex <- n * log(x = epsilon_2_bar_agesex) + k_age_sex * log(x = n) ``` --- ### Comparison BIC - Now that we have calculated the BIC values for each of the two models we can compare them. -- .pull-left[ `\(BIC\left(\mathrm{M}_{age}\right) =\)` 529.37 ] .pull-right[ `\(BIC\left(\mathrm{M}_{age,sex}\right) =\)` 481.71 ] -- - The BIC value of the model that includes only age as a coefficient was higher by 47.66 units in comparison to the model that includes both age and sex as coefficients. -- - This means that, when comparing only these two models, we should always prefer the one that has both covariates included in the model. -- - If this are the only two models that we want to test then we would be done and now we can interpret the parameters on the model and add the interpretation to our results section. --- class: middle, center, inverse ### Now it's your turn. --- ### Use of BIC - As we saw in our example we can use the BIC is a good alternative when we have to compare two or more linear modes that have been fit to the same data. -- - The BIC can also be used in other cases, for example, we will see that when using a generalized linear model we can also use the BIC. -- - However, the mathematical expression that we have used is only a special case of the "normality" assumption. -- - That means that we shouldn't use it if we do not believe that the errors follow a normal distribution. -- - Instead, the general form of the BIC asks us to use something called the likelihood. --- ### Likelihood - The likelihood of a model refers to the probability of observing the data that we actually observed conditional on the value of some parameter. -- - Like the conditional expectation that means that we can set the values of the parameters and calculate the value of the likelihood using some mathematical expression. -- - We will not go over how to do this calculation exactly, however, this will be the default method for calculating the BIC of a model in R. -- - For example, we can use the function `BIC()` in R in order to calculate the Bayesian Information Criterion of a model using the likelihood function. --- ### Example: BIC with R - For example, to get the BIC of the three models we have used this class using the `BIC()` function: ``` r bic_age <- BIC(object = lm_a) bic_age_sex <- BIC(object = lm_as) bic_age_sex_int <- BIC(object = lm_asi) ``` -- - With this method we get new values for the BIC of our models that are: `\(BIC\left(\mathrm{M}_{age}\right) =\)` 1096.94 `\(\quad\)` `\(BIC\left(\mathrm{M}_{age,sex}\right) =\)` 1049.28 `\(\quad\)` `\(BIC\left(\mathrm{M}_{age,sex,int}\right) =\)` 1038.16 -- - **Notice** that while the values have changed (the magnitude is not the same), we still have the same order between the 3 models, where according to the BIC we should prefer the model that includes all there parameters.